Import Necessary Packages

In [3]:
#Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Helpful
import os 
# Linear Regression Class
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import  max_error
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge, Lasso
from sklearn.linear_model import SGDRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import seaborn as sns
import missingno as msno
import warnings
warnings.filterwarnings("ignore")
# Metrics - R2
from sklearn.metrics import r2_score
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
pd.set_option('display.max_rows', 127)

Experiment for Loan Statistics Dataset

In [4]:
import azureml.core
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))
Ready to use Azure ML 1.36.0 to work with ml_practicing

1 Data Understanding

In [5]:
from azureml.core import Experiment

# Create an Azure ML experiment in your workspace
experiment = Experiment(workspace=ws, name="LoanStats-workspace")

# Start logging data from the experiment, obtaining a reference to the experiment run
run = experiment.start_logging()
print("Starting experiment:", experiment.name)
print('\n')

# Load the data from a local file
loans = pd.read_csv("LoanStats.csv")

# Count the rows and log the result
row_count = (len(loans))
run.log('row count', row_count)
print('Analyzing {} rows of data'.format(row_count))

# Count the columns and log the result
column_count = (loans.shape[1])
run.log('column count', column_count)
print('Analyzing {} columns of data'.format(column_count))
print('\n')

# Read the first 5 rows
print('First 5 rows in the dataset:')
print(loans.head())
print('\n')

# Check Datatypes
print('Data type of each column:')
run.log('data types of each column', loans.dtypes)
print(loans.dtypes)
Starting experiment: LoanStats-workspace


Analyzing 39786 rows of data
Analyzing 7 columns of data


First 5 rows in the dataset:
   int_rate  loan_amnt  term grade home_ownership  annual_inc         purpose
0     10.65       5000    36     B           RENT     24000.0     credit_card
1     15.27       2500    60     C           RENT     30000.0             car
2     15.96       2400    36     C           RENT     12252.0  small_business
3     13.49      10000    36     C           RENT     49200.0           other
4     12.69       3000    60     B           RENT     80000.0           other


Data type of each column:
int_rate          float64
loan_amnt           int64
term                int64
grade              object
home_ownership     object
annual_inc        float64
purpose            object
dtype: object

1.1 Descriptive Statistics for Numeric and Categorical Features

  • There is no feature that has zero variance.
In [6]:
# Log summary statistics for numeric columns
num_columns = ['int_rate', 'loan_amnt', 'annual_inc', 'term']

summary_stats = loans[num_columns].describe().to_dict()
for col in summary_stats:
    keys = list(summary_stats[col].keys())
    values = list(summary_stats[col].values())
    for index in range(len(keys)):
        run.log_row(col, stat=keys[index], value = values[index])
        
# Check the Descriptive Statistics
loans.describe()
Out[6]:
int_rate loan_amnt term annual_inc
count 39786.000000 39786.000000 39786.000000 3.978600e+04
mean 12.027873 11231.360277 42.448499 6.897907e+04
std 3.727466 7464.542832 10.638782 6.376263e+04
min 5.420000 500.000000 36.000000 4.000000e+03
25% 9.250000 5500.000000 36.000000 4.050000e+04
50% 11.860000 10000.000000 36.000000 5.900000e+04
75% 14.590000 15000.000000 60.000000 8.234250e+04
max 24.590000 35000.000000 60.000000 6.000000e+06
In [7]:
# Check the Descriptive Statistics for String Data
loans.describe(exclude='number')
Out[7]:
grade home_ownership purpose
count 39786 39786 39786
unique 7 5 14
top B RENT debt_consolidation
freq 12035 18918 18676

1.2 Data Visualizations

  • When the categorical features are analyzed, grade of the person definitely has a direct impact over the interest rate. purpose and home_ownership features don't demonstrate an exact relation to the interest rate specified. For the impact of loan_amnt, annual_inc or term, a correlation matrix must be visualized. Yet, the graphs don't suggest a strong relationship.

  • Frequency distributions demonstrate that there is no normal distribution for int_rate, loan_amnt, and annual_inc. Instead, the distributions are right_skewed which suggests that a normalization procedure must be followed in the data preparation stage.

In [8]:
# Visualize the correlations in the dataset by grade
plot = sns.pairplot(loans,hue='grade', diag_kind='kde')

run.log_image(name='Pairplot by grade', plot=plot)
In [9]:
# Check frequency distributions of the numerical features
list_of_features = ['int_rate','loan_amnt','annual_inc','term']

for feat in list_of_features:
    distributions = sns.histplot(data=loans, x=feat, kde=True)
    plt.show()
In [10]:
# Visualize the impact of grade over interest rate
sns.boxplot(x="grade", y="int_rate", data=loans)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3e2f4cf518>
In [11]:
# Visualize the impact of term over interest rate
sns.boxplot(x="term", y="int_rate", data=loans)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3e2f12de10>

1.3 Visualizing the Correlation Matrix

  • Highest correlation appears to be between int_rate and term with 0.45. Yet, a strong correlation normally suggests a score higher than 0.75.
  • Lowest correlation appears to be between int_rate and annual_inc with 0.5. Therefore, it can be said that there is almost no impact of the annual income of a customer over the interest rate.
In [12]:
# Create the correlation matrix
corr = loans.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set the color palette
cmap = sns.diverging_palette(h_neg=10,
h_pos=240,
as_cmap=True)

# Create the heatmap with the color palette
sns.heatmap(corr, mask=mask,
center=0, cmap=cmap, linewidths=1,
annot=True, fmt=".2f")

plt.show()

2 Data Preprocessing

2.1 Missing Values (Columns & Rows) Detection and Removal

  • For Rows: The threshold for removing the rows with missing values is 90%. If more than 90% of a row is missing, then, it will removed from the dataset.
  • For Columns: The threshold for removing the columns with missing values is 55%. If more than 55% of a column is missing, then, it will be removed from the dataset.

Findings: There is no missing value in any column or row. Count value for all columns is 39786.

2.1.1 Columns

In [13]:
# Visualize Missing Values
msno.matrix(loans)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3e2ec95be0>
In [14]:
# detect the columns with more than 55% null values 
null_values = loans.loc[:, loans.isnull().mean() > .55]

print('Number of columns BEFORE removal: ' + str(loans.shape[1]))

# create more than 55% null value columns list and remove them from the dataset
NA_columns = [column for column in null_values.columns]
print('There are {} columns to be removed'.format(len(NA_columns)))

# remove the NA columns
loans.drop(labels=NA_columns, axis=1, inplace=True)

print('Number of columns AFTER removal: ' + str(loans.shape[1]))
Number of columns BEFORE removal: 7
There are 0 columns to be removed
Number of columns AFTER removal: 7

2.1.2 Rows

In [15]:
to_drop = []

for row in range(len(loans)):
    if loans.loc[row,:].isna().sum().sum()/len(loans.columns) > .9:
        to_drop.append(loans.loc[row,:]) 
        
to_drop = pd.DataFrame(to_drop, columns = loans.columns)    
print(' The number of rows where missing values are more than 90%: ' + str(len(to_drop)))
loans = loans[~loans.isin(to_drop)].dropna()
print(' Remaining number of rows AFTER removal: ' + str(len(loans)))
 The number of rows where missing values are more than 90%: 0
 Remaining number of rows AFTER removal: 39786

2.2 Outlier Detection and Treatment with NaN

  • Here, the applied method for outlier detection is 3z rule by standard deviation calculation for each column. For all numeric columns, the upper and lower boundaries are calculated and detected outliers are replaced with NaNs.
In [16]:
def OutlierDetection(df):
    for col in df.columns:
        if df[col].dtype != 'object':
            upper = df[col].mean() + 3*df[col].std()
            lower = df[col].mean() - 3*df[col].std()
            df[col] = np.where(df[col] > upper, np.nan, np.where(df[col] < lower, np.nan, df[col]))                    
In [17]:
print('Number of NaNs BEFORE removing the outliers:') 
loans.isnull().sum()
Number of NaNs BEFORE removing the outliers:
Out[17]:
int_rate          0
loan_amnt         0
term              0
grade             0
home_ownership    0
annual_inc        0
purpose           0
dtype: int64
In [18]:
OutlierDetection(loans)
print('Number of NaNs AFTER removing the outliers:')
loans.isnull().sum()
Number of NaNs AFTER removing the outliers:
Out[18]:
int_rate           38
loan_amnt         719
term                0
grade               0
home_ownership      0
annual_inc        289
purpose             0
dtype: int64

2.3 Imputation of NaNs via MICE

  • NaNs in the dataset has to be imputed before running scaling and one-hot encoding transformations. Multiple Imputation by Chained Equations is a robust method to deal with missing values.
In [19]:
# define a custom function for MICE IMPUTATION
def MiceImputation(df):
    MICE_imputer = IterativeImputer(max_iter=10, random_state=0) 
    numeric_only = df.loc[:, df.dtypes != np.object]
    numeric_cols = numeric_only.columns.to_list()
    categorical_only = df.loc[:, df.dtypes == np.object]
    cat_copy = numeric_only.copy(deep = True)
    imputed = MICE_imputer.fit_transform(cat_copy)
    imputed2 = pd.DataFrame(imputed, columns= numeric_cols)
    df2 = imputed2.join(categorical_only)
    return df2
In [20]:
# Apply the imputation function on the numeric columns of loans dataset
loans = MiceImputation(loans)
print('Number of NaNs AFTER MICE imputation:')
loans.isnull().sum()
Number of NaNs AFTER MICE imputation:
Out[20]:
int_rate          0
loan_amnt         0
term              0
annual_inc        0
grade             0
home_ownership    0
purpose           0
dtype: int64

2.4 Separation of Target and Predictor Features

In [21]:
X = loans.drop('int_rate',axis=1)
y = loans['int_rate'] 

2.5 Transforming Numeric-Categorical Features

  • To process separately numeric columns through minmaxscaling and categorical columns through onehot encoding, scikitlearn's ColumnTransformer provides a fast solution. The transformers built will be used for the pipelines built for different models later.
In [22]:
# Classify the Features
categorical_feature_mask = X.dtypes == 'object'
categorical_features = X.columns[categorical_feature_mask].tolist()
numeric_features = X.columns[~categorical_feature_mask].tolist()

# define the transformers separately
numeric_transformer = MinMaxScaler()
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

# assign the transformers to classified features
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

2.6 Train-Test Split

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle=True, random_state=5)

3 Modeling

3.1 Linear Regression

In [24]:
# Instantiate the model
LinearReg = LinearRegression()


# Append the model to preprocessing pipeline.
lr = Pipeline(
    steps=[("preprocessor", preprocessor), ("model", LinearReg)]
)


# fit the  model to the train set
lr.fit(X_train,y_train)

# make predictions on the test set and train set
y_pred_test = lr.predict(X_test)
y_pred_train = lr.predict(X_train)

# Compute and print R^2 and RMSE
print("R^2 for Train Set: {}".format(lr.score(X_train, y_train)))
print("R^2 for Test Set: {}".format(lr.score(X_test, y_test)))
RMSE_test = np.sqrt(mean_squared_error(y_test,y_pred_test))
RMSE_train= np.sqrt(mean_squared_error(y_train,y_pred_train))
print('\n')
print("Root Mean Squared Error for Train Set: {}".format(RMSE_train))
print("Root Mean Squared Error for Test Set: {}".format(RMSE_test))
R^2 for Train Set: 0.9221103058925401
R^2 for Test Set: 0.9210819682035682


Root Mean Squared Error for Train Set: 1.0334064997491457
Root Mean Squared Error for Test Set: 1.0479615374816407

3.2 Inspecting Feature Importances

  • Although previous correlation matrix visualizations have already confirmed, we need to inspect the feature importances through model coefficients to have a concrete decision. Since one hot encoding is applied to the categorical features, the output of the pipeline is 28 columns, excluding the target variable. According to the findings in this section, the most important factor is grade which is followed by term and loan_amnt.
In [36]:
# extract one-hot encoding applied features
onehot_columns = list(lr.named_steps["preprocessor"].named_transformers_['cat'].get_feature_names(input_features=categorical_features))
numeric_features_list = list(numeric_features)
numeric_features_list.extend(onehot_columns)

# Zip the coefficients with the numeric column names 
feature_importance = pd.DataFrame(zip(lr[1].coef_,numeric_features_list), columns = ['coefficient','name'])
In [66]:
feature_importance.sort_values('coefficient', ascending=True)
Out[66]:
coefficient name
3 -7.457181 grade_A
4 -3.898663 grade_B
5 -1.389022 grade_C
11 -0.422985 home_ownership_NONE
25 -0.261696 purpose_renewable_energy
15 -0.142881 purpose_car
21 -0.112748 purpose_major_purchase
10 -0.035883 home_ownership_MORTGAGE
27 -0.027168 purpose_vacation
22 -0.002166 purpose_medical
26 0.002883 purpose_small_business
19 0.012713 purpose_home_improvement
28 0.030186 purpose_wedding
17 0.058851 purpose_debt_consolidation
14 0.063382 home_ownership_RENT
16 0.070346 purpose_credit_card
20 0.072259 purpose_house
23 0.086907 purpose_moving
13 0.088257 home_ownership_OWN
18 0.102265 purpose_educational
24 0.110249 purpose_other
2 0.154820 annual_inc
12 0.307229 home_ownership_OTHER
0 0.347106 loan_amnt
1 0.440043 term
6 0.708517 grade_D
7 2.546886 grade_E
8 4.498156 grade_F
9 4.991306 grade_G
In [69]:
import plotly.express as px

# visualize the coefficients
fig = px.bar(feature_importance, x='coefficient', y='name')
fig.show()

3.3 GridSearchCV with Regularization (L1 & L2 Regularized Modelling)

In [70]:
# Cross Validation with LASSO
lasso = Lasso(random_state=42)
las_pip = Pipeline(
    steps=[("preprocessor", preprocessor), ("model", lasso)])

# Cross Validation with RIDGE
ridge = Ridge(random_state=5)
rid_pip = Pipeline(
   steps=[("preprocessor", preprocessor), ("model", ridge)])


pipelines = [las_pip,rid_pip]
param_grid = {'model__alpha': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1]}

for i in range(len(pipelines)):
  gcv= GridSearchCV(pipelines[i], param_grid , scoring='r2', cv=5)
  gcv.fit(X_train,y_train)
  y_pred = gcv.predict(X_test)
  RMSE = np.sqrt(mean_squared_error(y_test,y_pred))
  print('Used Model: ' + str(pipelines[i][1]))
  print('\n')
  print("Best parameters found: ",gcv.best_params_)
  print("Best R^2 found: {:.3f}".format(np.sqrt(np.abs(gcv.best_score_))))
  print("Root Mean Square Error: {:.3f}".format(RMSE))
  print('\n')
Used Model: Lasso(random_state=42)


Best parameters found:  {'model__alpha': 0.0001}
Best R^2 found: 0.960
Root Mean Square Error: 1.048


Used Model: Ridge(random_state=5)


Best parameters found:  {'model__alpha': 0.1}
Best R^2 found: 0.960
Root Mean Square Error: 1.048


3.4 GridSearchCV with Stochastic Gradient Descent Regressor

In [129]:
# GridSearchCV with SGDRegressor
sgd= SGDRegressor(random_state=100)

sgd_pip = Pipeline(
    steps=[("preprocessor", preprocessor), ("model", sgd)])

param_grid = {'model__alpha': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1],'model__eta0': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1],
              'model__max_iter':[10,100,1000], 'model__loss':['squared_loss'],
              'model__penalty':['l1','l2','elasticnet'],'model__learning_rate':['adaptive','optimal']}

gcv= GridSearchCV(sgd_pip, param_grid , scoring='r2', cv=5)
gcv.fit(X_train,y_train)
y_pred = gcv.predict(X_test)
RMSE = np.sqrt(mean_squared_error(y_test,y_pred))
print('-- Used Model: SGDRegressor() --')
print('\n')
print("Best parameters found: ",gcv.best_params_)
print("Best R^2 found: {:.3f}".format(np.sqrt(np.abs(gcv.best_score_))))
print("Root Mean Squared Error: {:.3f}".format(RMSE))
print('\n')

ModelsFinal = pd.DataFrame(gcv.cv_results_)

ModelsFinal.sort_values('rank_test_score', ascending = True)
-- Used Model: SGDRegressor() --


Best parameters found:  {'model__alpha': 0.001, 'model__eta0': 1, 'model__learning_rate': 'adaptive', 'model__loss': 'squared_loss', 'model__max_iter': 100, 'model__penalty': 'l1'}
Best R^2 found: 0.960
Root Mean Squared Error: 1.049


Out[129]:
mean_fit_time std_fit_time mean_score_time std_score_time param_model__alpha param_model__eta0 param_model__learning_rate param_model__loss param_model__max_iter param_model__penalty params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
309 0.200597 0.022959 0.008549 0.001358 0.001 1 adaptive squared_loss 100 l1 {'model__alpha': 0.001, 'model__eta0': 1, 'mod... 9.219031e-01 9.210619e-01 9.232884e-01 9.245611e-01 9.181267e-01 9.217882e-01 2.185719e-03 1
312 0.206477 0.025257 0.008754 0.001982 0.001 1 adaptive squared_loss 1000 l1 {'model__alpha': 0.001, 'model__eta0': 1, 'mod... 9.218930e-01 9.210619e-01 9.232884e-01 9.245611e-01 9.181267e-01 9.217862e-01 2.185602e-03 2
61 0.086616 0.002340 0.007806 0.000161 1e-05 0.01 adaptive squared_loss 1000 l2 {'model__alpha': 1e-05, 'model__eta0': 0.01, '... 9.220001e-01 9.211166e-01 9.234264e-01 9.232950e-01 9.181905e-01 9.216057e-01 1.909283e-03 3
58 0.080156 0.002313 0.007835 0.000195 1e-05 0.01 adaptive squared_loss 100 l2 {'model__alpha': 1e-05, 'model__eta0': 0.01, '... 9.220001e-01 9.211166e-01 9.234264e-01 9.232950e-01 9.181905e-01 9.216057e-01 1.909283e-03 3
62 0.103381 0.004421 0.009440 0.001906 1e-05 0.01 adaptive squared_loss 1000 elasticnet {'model__alpha': 1e-05, 'model__eta0': 0.01, '... 9.219990e-01 9.211174e-01 9.234269e-01 9.232936e-01 9.181912e-01 9.216056e-01 1.908820e-03 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
414 0.046115 0.003994 0.008601 0.001714 0.01 1 adaptive squared_loss 10 l1 {'model__alpha': 0.01, 'model__eta0': 1, 'mode... -1.287071e+23 -9.785708e+14 -9.467748e+22 -6.361871e+22 -2.277831e+13 -5.740066e+22 5.119083e+22 644
524 0.042728 0.000648 0.007618 0.000150 0.1 1 adaptive squared_loss 10 elasticnet {'model__alpha': 0.1, 'model__eta0': 1, 'model... -1.680219e+23 -2.549692e+22 -1.180736e+03 -5.263420e+10 -1.221610e+23 -6.313598e+22 6.909996e+22 645
308 0.045466 0.003123 0.008153 0.001368 0.001 1 adaptive squared_loss 10 elasticnet {'model__alpha': 0.001, 'model__eta0': 1, 'mod... -1.916396e+23 -1.242073e+16 -1.580826e+23 -4.660293e+14 -7.005579e+15 -6.994444e+22 8.631885e+22 646
523 0.036481 0.000078 0.007643 0.000261 0.1 1 adaptive squared_loss 10 l2 {'model__alpha': 0.1, 'model__eta0': 1, 'model... -1.388162e+23 -3.804423e+22 -4.498119e+22 -1.047790e+23 -5.844471e+22 -7.701307e+22 3.867809e+22 647
90 0.042441 0.000674 0.007468 0.000196 1e-05 1 adaptive squared_loss 10 l1 {'model__alpha': 1e-05, 'model__eta0': 1, 'mod... -1.719809e+23 -1.573408e+23 -1.798834e+15 8.452655e-01 -1.078347e+23 -8.743129e+22 7.448589e+22 648

648 rows × 19 columns

4 Evaluation

Among the linear regressors trained here, the best R2 result appears to be received when LASSO regression is deployed with the following hyperparameters:

  • Alpha: 0.01

This is further confirmed through running gridsearchCV with more hyperparameters via SGDRegressor.

Using the above-specified hyperparameters, the best R-squared score is 0.96 while the lowest root mean square error is 1.048. Since our aim is to get the most accurate prediction results for the interest rate column, R2 score have more weight in our analyses.

The most decisive factors for interest rate are: grade, term, and loan amount.

Note: Given the stochastic nature of the algorithms, the results are prone to change on different runs.